Source: https://archive.ics.uci.edu/ml/datasets/Spambase

Data Set Information:

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...

Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

Attribute Information:

The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:

48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR] = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,...] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail

1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

There is a small imbalance problem, there are 1813 mails considered as spam and 2788 mails considered as nor spam

It seems like the columns have low correlation.

Lasso

These are R2 scores, so let's compare them to the prediction's r2 score

The accuracy score is very good, but the cross validation results show some underfitting. It's not the case for test score though.

Decision Tree

The best "the minimal number of observations per tree leaf" is 18 and best complexity is 0.0. The accuracy score is not much better than lasso regression. Maybe it could be improved further with more hyperparameter tuning. But the results are pretty good as is.

Random Forest

There is no sign of under or overfitting and the accuracy score is consistent with all samples.

However, it is interesting that the original RandomForestClassifier() was better.

Stochastic Gradient Boosting

:(

0.956 accuracy score is reached. There is still some overfitting, feature selection could be useful.

Let's see the cofusion matrix of the best model.

The number of false positives and false negatives are relatively low, however the model produces a higher percentage of false positives.